import numpy as np
import pandas as pd
import random
from matplotlib import pyplot as plt
import seaborn as sns
The Ras_SOCS.xlsx file has been retrived from UniProtKB via the advanced search construction:
database:(type:pfam ras) database:(type:pfam socs_box)
All the proteins found belong to diverse bilaterians.
table1 = pd.read_excel("./Ras_SOCS.xlsx")
table1.head()
table1.shape
table1.isna().any()
table1[table1["Taxonomic lineage (FAMILY)"].isna()]
A few entries with unidentified families in the large class Aves are safe to be ignored, they will simply not be accounted for during across-families sampling.
table1[table1["Taxonomic lineage (CLASS)"].isna()]
It appeares that all the entries with unidentified class names are either crocodiles, or turtles. Then they can be united under Archosauromorpha name:
table1.loc[table1[table1["Taxonomic lineage (CLASS)"].isna()].index, "Taxonomic lineage (CLASS)"] = "Archosauromorpha"
table1["Taxonomic lineage (CLASS)"].unique()
table1[table1["Cross-reference (Pfam)"] != "PF00071;PF07525;"].shape[0]
The advanced search construction used to retrieve the table1 sequences, in fact, finds all the proteins having the 2 queried domains, but doesn't pay attention to the presence of any other domains in the found proteins. The extra domains may be mixed up with the 2 target domains in the proteins found. Moreover, in some of the found two-domain proteins the order of domains may be inverted. However, a multidomain protein may be homologous to the target two-domain proteins if the extra domain was added to N- or C-term of the two-domain architecture during a recent evolutinary event.
Thus it's hard to discern non-homologues in the table. Since there are only 9 extra domain proteins in the table and some of them may be homologous to the target architecture, it has been decided to retain them, as it could make no great bias.
plt.figure(figsize=[9, 7])
a = plt.hist(table1["Length"], bins=25, density=False)
plt.xticks(a[1], rotation=45, size=12)
plt.yticks(size=12)
plt.xlabel("Protein length", size=14)
plt.ylabel("Entries per bin", size=14)
plt.grid(True, axis="y")
plt.title("Protein lenght histogram", size=16)
plt.show(a)
Based on the histogram above, proteins with the length range from 200 to 330 aa are further considered as of typical length:
typical = table1[(table1["Length"] > 200) & (table1["Length"] < 330)]
typical["Taxonomic lineage (PHYLUM)"].nunique(),\
typical["Taxonomic lineage (CLASS)"].nunique(), typical["Taxonomic lineage (FAMILY)"].nunique()
typical["Taxonomic lineage (CLASS)"].value_counts()
A small sample of proteins of typical length for HMM profile building is being collected with special attention to taxonomic diversity:
sampled_AC = []
random.seed(7)
for cl in typical["Taxonomic lineage (CLASS)"].unique():
cl_table = typical[typical["Taxonomic lineage (CLASS)"] == cl]
if cl_table.shape[0] <= 3:
sampled_AC += list(cl_table["Entry"])
else:
if cl_table["Taxonomic lineage (FAMILY)"].nunique() < 3:
entries = list(cl_table["Entry"])
random.shuffle(entries)
sampled_AC += entries[:3]
else:
families = list(cl_table["Taxonomic lineage (FAMILY)"].unique())
random.shuffle(families)
families = families[:3]
for fam in families:
entries = list(cl_table[cl_table["Taxonomic lineage (FAMILY)"] == fam]["Entry"])
sampled_AC.append(random.choice(entries))
random.seed()
del typical
len(sampled_AC)
table1["Sampled"] = table1["Entry"].apply(lambda x: x in sampled_AC)
table1.to_excel("tables.xlsx", sheet_name="table_1", index=False)
Retrieving sampled sequences from UniProt:
for AC in sampled_AC:
!wsl wget https://www.uniprot.org/uniprot/{AC}.fasta
!wsl cat ./*.fasta > sample.fa
!wsl rm ./*.fasta
Adjusting retrieved sequence names:
!wsl sed -i -r "/>/s/^>(sp|tr)\|(.+)\|.+$/>\2/" ./sample.fa
The alignment has been performed with the MAFFT v7.475 program. As the domains of interest are situated quite closely to each other, in one conservative protein region, L-INS-I algorithm has been used:
In Unix cmd:
linsi sample.fa > alignment.afa
The obtained alignment then has been refined manually using the Jalview editor:
N-end has been trimmed up to the 99th position, C-end seems to be conservative, thus has remained untrimmed, sequences 25, 39, 47, 48 and 24, 31, 42, 8 have been removed as having non-conservative N- or C-terms respectively. Sequences containing internal insertions have been retained.
Tht resulting alignment is in the refined_alignment.afa file.
The HMM profile, configured for global alignment finding, has been built with hmm2build from the HMMER 2.3.2 package:
In Unix cmd:
hmm2build -g -n Ras_SOCS ras_socs.hmm refined_alignment.afa
Then it has been calibrated with default settings:
In Unix cmd:
hmm2calibrate --cpu 1 ras_socs.hmm
To find out the cutoff score with reasonable sensitivity to specificity ratio for the built HMM profile, the SOCS.fasta file, containing all the SOCS_box domain protein sequences of metazoan origin, has been retrieved from UniProtKB via the advanced search construction:
database:(type:pfam socs_box) taxonomy:"Metazoa [33208]"
Adjusting retrieved sequence names:
!wsl sed -i -r "/>/s/^>(sp|tr)\|(.+)\|.+$/>\2/" ./SOCS.fasta
Next, search for the domain architectures, encoded by the built HMM profile, has been undertaken in the retrieved proteins via hmm2search from the HMMER 2.3.2 package:
In Unix cmd:
hmm2search --cpu 1 -E 10 ras_socs.hmm SOCS.fasta > results.txt
Such a high E-value has been taken as it was previously shown that -E 0.1 gives an output just 2 sequences longer than the table1, which may be useless for profile testing (data not shown).
Adjusing the output to tsv format and downloading it:
!wsl sed -r -e '1,/Parsed for domains/d' -e '/Alignments of top-scoring domains/,$d' -e '/-----/d' results.txt |\
wsl sed -r 's/ +/\t/g' | wsl sed -r -e '1s/(seq-t\t)/\1coverage_seq\t/' -e '1s/(hmm-t\t)/\1coverage_hmm\t/' > results.tsv
table2 = pd.read_csv("./results.tsv", sep="\t")
Marking the true findings in the output:
table1_AC = list(table1["Entry"])
table2["in_table1"] = table2["Sequence"].apply(lambda x: x in table1_AC)
del table1_AC
Sorting table2 by score increase and calculating Sensitivity and Specificity values for each score (as if it were taken as a cutoff):
table2 = table2.sort_values(by="score", ascending=False).reset_index(drop=True)
seq_num = !wsl grep -c ">" SOCS.fasta
seq_num = int(seq_num[0])
good_seq_num = table1.shape[0]
bad_seq_num = seq_num - good_seq_num
rejected_bad_seq_num = bad_seq_num - (~table2["in_table1"]).agg(np.sum)
#Sensitivity|cutoff = TP|cutoff / (TP + FN)
table2["Sensitivity"] = [table2["in_table1"][:n].agg(np.sum) / good_seq_num for n in table2.index]
#Specificity|cutoff = TN|cutoff / (TN + FP)
table2["Specificity"] = [((~table2["in_table1"][n:]).agg(np.sum) + rejected_bad_seq_num) / bad_seq_num for n in table2.index]
with pd.ExcelWriter("tables.xlsx", engine="openpyxl", mode="a") as writer:
table2.to_excel(writer, sheet_name="table_2", index=False)
ROC curve:
fig = plt.figure(figsize=[8, 6])
big_ax = fig.add_axes([0, 0, 1, 1])
big_ax.plot(1 - table2["Specificity"], table2["Sensitivity"])
big_ax.grid(True)
plt.xticks(size=12)
plt.yticks(size=12)
big_ax.set_xlabel("1 - specificity", size=14)
big_ax.set_ylabel("sensitivity", size=14)
plt.annotate("", xy=(0.0001, 1), xytext=(0.00035, 0.703), arrowprops=dict(arrowstyle="-"))
plt.annotate("", xy=(0.000903, 1.005), xytext=(0.000805, 0.703), arrowprops=dict(arrowstyle="-"))
plt.title("ROC curve", size=16)
small_ax = fig.add_axes([0.4, 0.2, 0.5, 0.5])
small_ax.plot(1 - table2["Specificity"], table2["Sensitivity"])
small_ax.set_xlim(0.0001, 0.0009)
small_ax.set_ylim(0.9950, 1.0005)
plt.xticks(rotation=45)
plt.yticks(np.arange(995, 1001)/1000)
small_ax.grid(True)
plt.show()
From the curve above it is evident that it takes only a negligibly small decrease in specificity to reach 100% sensitivity. So, any reasonable score cutoff, resulting in 100% sensitivity, would also have almost perfect specificity. Conversely, score cutoff with almost 100% specificity would result in >95% sensitivity. The profile is inherently both very sensitive and very specific.
Defining the score cutoff:
The first finding, absent from table1, (A0A3M0KYX4) has a distinctly high score and is followed by a number of findings with the target architecture (i.e. present in table1). According to InterPro annotation, A3M0KYX4 has Rab domain and SOCS-box domain; since Rab is a subfamily of Ras family and many proteins from table1 are annotated as somewhat like "Ras-related Rab protein", this protein may still be considered to have the target domain architecture. Pfam profile for Ras seems to be sensitive to the whole family, so it is quite unexpectedly to find the Ras domain unidentified by Pfam in this protein . The bulk of non-homologous findings starts, however, after the score falls below 58:
table2[table2["in_table1"] == False].head(10)
However, there are some target findings with score below 58. A0A6I9Y2F8 (score = 48.6) has been manually verified to have the target architecture in agreement with its placing in table1. Nevertheless, the next finding, A0A7G3AK77, (score = -54.3), though being present in table1, has inversed domain architecture (SOCS+Ras), so is not a target finding:
table2[table2["in_table1"] == True].tail(10)
There is a steep score decrease after a rather common value of score = 325. However, there are a lot of target findings below this score and sequences with the mentioned stereotypic score = 325 all seem to belong to the large class of Aves (just numerous homologues):
table2.iloc[1220:1245, :]
Score decline graph:
fig = plt.figure(figsize=[9, 7])
plt.plot(table2.index, table2["score"])
plt.plot([-25, 1300], [325, 325], linestyle="--")
plt.plot([-25, 1300], [58, 58], linestyle="--")
plt.xlim(-25, 1300)
plt.xticks(size=12)
plt.yticks(size=12)
plt.text(150, 337, "score cutoff = 325", fontsize=12, color="orange")
plt.text(150, 70, "score cutoff = 58", fontsize=12, color="green")
plt.xlabel("Finding number, sorted by score", size=14)
plt.ylabel("Score", size=14)
plt.grid(True)
plt.title("Score decline graph", size=16)
plt.show()
So there could be defined the two cutoffs: a strict one (score > 325), based on the steep step on the Score decline graph, and a relaxed one (score > 58), based on the score of the second false finding. The relaxed cutoff seems better (more sensitive, than the strict one, though equally specific), which is quite explicit from the confusion matrices below:
Confusion matrix for the strict cutoff (score cutoff = 325):
conf_matrix_strict = pd.DataFrame({"in table_1":[table2[table2["score"] > 325]["in_table1"].agg(np.sum),
table2[table2["score"] <= 325]["in_table1"].agg(np.sum)],
"not in table_1":[(~table2[table2["score"] > 325]["in_table1"]).agg(np.sum),
(~table2[table2["score"] <= 325]["in_table1"]).agg(np.sum)]},
index=["predicted with hmm", "not predicted with hmm"])
conf_matrix_strict
with pd.ExcelWriter("tables.xlsx", engine="openpyxl", mode="a") as writer:
conf_matrix_strict.to_excel(writer, sheet_name="score_cutoff_325", index=True)
Confusion matrix for the relaxed cutoff (score cutoff = 58):
conf_matrix_relaxed = pd.DataFrame({"in table_1":[table2[table2["score"] > 58]["in_table1"].agg(np.sum),
table2[table2["score"] <=58]["in_table1"].agg(np.sum)],
"not in table_1":[(~table2[table2["score"] > 58]["in_table1"]).agg(np.sum),
(~table2[table2["score"] <= 58]["in_table1"]).agg(np.sum)]},
index=["predicted with hmm", "not predicted with hmm"])
conf_matrix_relaxed
with pd.ExcelWriter("tables.xlsx", engine="openpyxl", mode="a") as writer:
conf_matrix_relaxed.to_excel(writer, sheet_name="score_cutoff_58", index=True)
Analysis complete